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Abstract—In today’s Web 2.0 era, online social media has become an integral part of our lives. In the course of the information 
revolution, the form of information has undergone a radical change, from simple text information to today’s integrated video, image, text 
and audio, and there has also been a great change in the way of dissemination and access, as people nowadays do not just rely on 
traditional media to passively receive information, but more actively and selectively obtain information from social media. Therefore, it 
has become a great challenge for us to effectively utilize these massive and integrated multi-modal media information to form an effective 
system of retrieval, browsing, analysis and usage. Unlike movies and traditional long-form video content, micro-videos are usually short 
in length, between a few seconds and tens of seconds, which allows users to quickly browse different contents and make full use of the 
fragmented time in their lives, while users can also share their micro-videos to their friends or the public, forming a unique social way. 
Video contains rich multimodal information, and fusing information from multiple modalities in a video classification task can improve the 
accuracy of the video classification task.According to the micro-video classification task, a new combinatorial network model is proposed 
to combine the discrete features of each modality into the overall features of various modalities through the network, and then fuse the 
various modal features to obtain the overall video features, which will be used for classification. In order to verify the effectiveness of the 


algorithm proposed in this paper, experiments are conducted in the public dataset, and it is shown the effectiveness of our model. 


Index Terms—Dictionary learning, Recommender system, Personalized recommendation, Multimodal, Cluster. 


1 INTRODUCTION 


With the rapid development of information technology, the 
form of media data we receive has changed from single 
text data to multimodal data with more vivid forms and 
richer contents. At the same time, the popularity of various 
digital information collection devices and the Internet has 
made micro-video, a form of data filmed, produced and 
shared by users, the most emerging and popular one. In 
today’s Web 2.0 era, online social media has become an 
integral part of our lives. In the course of the information 
revolution, the form of information has undergone a radical 
change, from simple text information to today’s integrated 
video, image, text and audio, and there has also been a great 
change in the way of dissemination and access, as people 
nowadays do not just rely on traditional media to passively 
receive information, but more actively and selectively obtain 
information from social media. Therefore, it has become a 
great challenge for us to effectively utilize these massive 
and integrated multi-modal media information to form an 
effective system of retrieval, browsing, analysis and usage. 
Unlike movies and traditional long-form video content, 
micro-videos are usually short in length, between a few 
seconds and tens of seconds, which allows users to 
quickly browse different contents and make full use of the 
fragmented time in their lives, while users can also share 
their micro-videos to their friends or the public, forming a 
unique social way. However, these features of micro-videos 
also bring some disadvantages. The short length and large 
quantity make the amount of information too large and 
complicated, and users find it difficult to find the content 
they are interested in, and often spend a lot of time on 
finding quality content. How to let users spend less time 


to find more interesting content is the core problem that 
every social media platform needs to solve. For social media 
platforms, if they rely on traditional search engines for 
passive push mode, they can only search according to the 
title of the video, keywords and other information, which 
does not cover the entire content for micro-videos, and thus 
usually cannot push the right content to users; for video 
authors, it also makes them focus more on the naming of 
the title, which obviously deviates from the original purpose 
of micro-videos This obviously deviates from the original 
purpose of micro-videos, which is convenience and sharing. 
Therefore, platforms are now relying more on personalized 
recommendation systems to actively push to users, and in 
this process, how to push quality and interesting content 
to users while filtering out unpopular content as much 
as possible is a very important and meaningful task for 
research. 

For the study of micro-videos, Wang et al. [1] studied 
the characteristics of teaching-related micro-videos and 
their semantic representation, while focusing on the 
comparative study of the relevance between different 
teaching micro-videos; Zhang et al. [2] proposed a micro- 
video segmentation method based on color histogram and 
local optimization. Redi et al. [3] conducted a study on the 
creativity of micro-video data on social media platforms by 
proposing a definition of creativity, constructing a small 
dataset on the creativity of micro-videos, and extracting 
various features to conduct experiments on creativity 
prediction. Sano et al. [4] studied the characteristics 
of looped playback of micro-videos. Nguyen et al. [5] 
investigated the own characteristics of micro-video data, 
such as multi-view shooting, multi-clip splicing. The 
authors argued that micro-videos are inherently “ready- 
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for-analysis” due to their length limitation and the fact 
that each frame contains much more important information 
than traditional long videos. A large open dataset of micro- 
videos is constructed, classified according to viewpoints and 
tags, and combined with convolutional neural networks and 
gesture recognition techniques to study the understanding 
of micro-videos. As for the topic of popularity prediction, 
it has attracted the attention of many researchers due 
to its great potential for commercial applications such 
as advertising placement [6], and most of the previous 
research works have been focused on text [7], image [8] and 
traditional video [9, 10]. 

With the advent of ConvolutionalNeuralNetworks 
(CNN) [11, 12, 13, 14, 15], people no longer needs to design 
feature descriptors manually, but automatically learns video 
semantic features and understands image content, and 
eventually achieved great success in image classification, 
detection and retrieval. The algorithm of micro-video 
recognition [16] based on deep learning has surpassed the 
iDT algorithm, making these traditional methods gradually 
fade out of people’s view. FaceBook even proposed to 
use 3D convolutional neural network to extract spatio- 
temporal information [17], which injected new vitality into 
the research of video classification. 

Researchers have also tried to apply it to video temporal 
modeling. Such methods usually first extract video frame 
features using CNNs and then input these features into 
RNNs [18] in temporal order for temporal modeling [19]. 
LSTM [20, 21], on the other hand, is a commonly used 
RNN model for modeling video long-time dependencies. 
For example, Ng et al. [22]Jused LSTM to fuse video frames 
and optical flow features, and experimentally verified the 
robustness of LSTM networks to optical flow noise and 
the effectiveness of video sequence feature fusion with 
LSTM. Video contains rich multimodal information, and 
fusing information from multiple modalities in a video 
classification task can improve the accuracy of the video 
classification task.The main contributions of this paper are 
as follows: 


e Study the extraction methods of visual features, audio 
features and title features in videos. In this paper, visual 
information is mainly extracted by dividing the video 
into multiple frames, and then each frame is passed 
through a pre-trained network to extract visual features. 
The audio information is extracted by dividing the audio 
into frames, and then each frame is digitally processed 
to obtain the spectrum, and then the audio features are 
extracted by VGG network. The title information is mainly 
obtained by word mapping into word vectors through 
word cutting technique to get word features. 

e According to the micro-video classification task, a new 
combinatorial network model is proposed to combine the 
discrete features of each modality into the overall features 
of various modalities through the network, and then fuse 
the various modal features to obtain the overall video 
features, which will be used for classification. 

e In order to verify the effectiveness of the algorithm 
proposed in this paper, experiments are conducted in the 
public dataset, and it is shown the effectiveness of our 
model. 


Filter Concat at Cluster 


Fig. 1: Schematic diagram of multimodal. 


2 RELATED WORK 


2.1 Personalized recommendation system 


The research on personalized recommendation systems 
is actually very short-lived. Its starting time would 
be from the 1990s, but this does not affect the 
scope of personalized recommendation systems’ impact 
on human life. Nowadays, it has penetrated into all 
aspects of our life, such as: e-commerce, news, travel, 
music, entertainment, etc [23]. With the introduction 
of personalized recommendation technology, we do not 
need to consume other resources to select content, but 
simply select the relevant interest categories, and the 
recommendation system will ”tailor” the content to the 
user’s preferences after a few seconds. During the user’s 
browsing process, the system collects the user’s browsing 
history and analyzes it to understand the user’s interest 
distribution, so as to achieve the purpose of “tailoring” [24]. 
The content filtering-based recommendation first mines 
the user’s history, i.e., the user’s browsing history, e.g., 
when the user reviews or describes an item, then it 
can be mined, and finally the TF-IDF [25] algorithm is 
used to determine the importance of words. Collaborative 
filtering based recommendation Collaborative filtering 
recommendation systems have been used in many fields 
nowadays. For example, Amazon recommendation in e- 
commerce, Instagram and Tiktok in micro-video APP, and 
google recommendation in search engine. In fact, the 
key is to use the acquired information data, find out 
the existence of predicted users with similar preferences 
to these information data, and calculate the similarity, 
and then generate the set of neighbors, and finally 
make recommendations based on the preferences of 
neighbors with higher similarity, and finally achieve the 
recommendation effect [26, 27, 28]. The core idea of the 
user-based collaborative filtering algorithm [29] is to use 
certain algorithms to calculate and speculate whether there 
is some connection between users, and this connection 
can be used by us to find the same interest preferences 
between two unused users, and then recommend items of 
interest to the target group of users who have the same 
connection. If we can obtain sufficient amount of data, 
then the final recommendation effect will also be more 
accurate [30]. The core idea of item-based collaborative 
filtering algorithm is to use the algorithm to calculate 
whether there is some connection between items and items, 
and then use this connection to then make effective and 
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reasonable recommendations [31]. 


2.2 Multimodal recommendation 


Co-training is a classical multi-view learning method for 
dealing with those cases where the dataset contains a 
large number of unlabeled samples, and belongs to the 
category of semi-supervised learning in machine learning. 
However, in real research problems, the requirement of the 
nature of full redundancy of views is difficult to satisfy, 
Nigam Ket al. [32] conducted an experimental study on 
the performance of co-training algorithms on problems that 
do not have fully redundant views Goldman S et al. [33] 
proposed a co-training algorithm that does not require fully 
redundant views. Two different classifiers are trained from 
the same attribute set, and in the training phase, each 
classifier still labels the unlabeled example with higher 
confidence and submits it to the other classifier for learning, 
while in the testing phase, the labeled confidence of the two 
classifiers on the test example is first estimated and the one 
with higher confidence is selected for the final prediction. 
Thereafter, they also extended the algorithm so that it can 
use several different kinds of classifiers. To further relax the 
constraints of co-training, Zhou ZH et al. [34] proposed a tri- 
training algorithm that neither requires a fully redundant 
view nor requires the use of different types of classifiers. 
Subspace learning is an important research area in machine 
learning that has been widely followed by researchers and 
the research results have been applied in various specific 
fields. 

In the current study, Liu Meng et al. [35] proposed a 
Joint Sequential-Sparse approach, where short videos are 
processed in frames, and features are extracted on LSTM 
according to Visual, Acoustic and Texture respectively, and 
the obtained features are mapped to the same space. Finally, 
a CNN network was used for dictionary learning to obtain 
the sparse encoding of visual, acoustic and textual for 
classification to obtain the classification information of the 
short videos, and certain results were achieved.Bo Peng 
et al. [36] proposed an unsupervised clustering algorithm 
based on motion scenes, which combines static scene 
features and dynamic features. A model through contextual 
constraints was established. First, the complementarity 
of multi-view subspace representations in each context 
is explored by single-view and multi-view constraints. 
Then, by computing the association matrix of contextual 
constraints and introducing MSIC to mutually regulate 
the inconsistency of subspace representations in scenes 
and motions. Finally, an overall objective function is 
constructed to guarantee the video motion clustering results 
by jointly constraining the complementarity of multiple 
views and the consistency of multiple contexts. Jun Yang 
et al. [37] proposed a method to identify the behavior of 
construction site employees using semantic information. 
A nonparametric data-driven scene analysis approach was 
used to identify the constructed objects. A context-based 
action recognition model was learned from the training 
data. Then, action recognition was improved by using 
the identified constructed objects. Jingyi Hou et al. [38] 
proposed a decomposed action scene network FASNet. 
FASNet consists of two parts, one is an Attention-based 


3 


CANET network, which is able to encode local spatio- 
temporal features to learn features with good robustness. 
The other part consists of a fusion network, which mainly 
fuses spatio-temporal features and contextual The other part 
consists of a fusion network, which mainly fuses spatio- 
temporal features and contextual features to learn more 
descriptive feature information. 


3 METHODOLOGY 
3.1 Feature Extraction 


Feature extraction is divided into scene feature extraction 
and behavior feature extraction. First, scene features are 
extracted formicro-videos. Let the micro-video sequence be 
V = {v1,v2,..., Un}, where, n is the number of micro- 
videos. The scene features are extracted using a deep fusion 
network based on VGGNet. The global features in the scene 
are learned and extracted using the VGGNet16 network, 
and the local detailed features in the scene are learned 
and extracted using VGGNet19 , and the learned features 
are fused respectively. The reason for using VGGNet is 
that the network chooses a 3 x 3 convolutional kernel, 
which makes the number of parameters smaller, and 
the superposition of small convolutional layers enables 
multiple nonlinear computations and better learning ability 
of features. Assuming that the number of scene categories 
is sceneN , for the i-th micro-video, the purpose of scene 
recognition is to find the maximum value of the scene 
prediction probability as follows: 


where, v; is the i-th micro-video and py’ is the probability 
value of the i-th micro-video corresponding to the j-th scene. 
But here, it is necessary to keep the probability values of v; 
in all scenes to retain as much useful information in the 
video as possible. 


fi, = (wey <5 < Nel <i<n), (2) 


Assuming that the number of behavior categories is Na 
, for the i-th micro-video, the result for behavior recognition 
can be defined as: 


i. =max{pi*}(1<k<.N,,1<i<n), (3) 


where, v; is the i-th micro-video and p?* is the probability 
value of the i-th micro-video corresponding to the k-th 
action. Furthermore, that can be defined as: 
RGB Flow 

eye a (4) 

Similarly, the behavioral feature extraction part, which 
needs to keep the probability values of each micro-video for 
all behaviors, is defined as follows: 


Ak — 


ak 
Po; ~~ Po; 


v= [PHL Sk S Nal Sin), (5) 


By obtaining Ts and Je , for the i-th micro-video, the 
joint feature is defined as : 


fo. = (FAY Fe, (6) 


where ie f, are the scene features and behavioral features 
of f,,, respectively, and the dimension of f,, is Ns x Na. 
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Fig. 2: Overall schematic diagram of our model. 


3.2 Attention mechanism 


The main modalities present in micro-video are visual, 
audio and text. Most of the features currently used in 
video processing algorithms are mainly visual features. 
Audio and text features are less used in computer vision, 
while audio and text also contain a lot of video-related 
information. Thus fusing several modalities will improve 
the accuracy of video classification. By extracting features 
from the visual, audio, and title of the video separately, and 
then performing feature fusion, the final prediction score 
is obtained by the classification function. There are two 
main feature fusion modes, which are forward fusion and 
backward fusion methods. In this paper, we adopt backward 
fusion, firstly, we input each modal information into the 
clustering network to get the corresponding features, and 
then concat to get the final feature vector. 

Assume that there are M frames of video, and the feature 
description x of each frame is N-dimensional. There are K 
clustering centers, and each frame is firstly encoded into an 
N x K feature vector as follows: 


Vijk = Ak (Ti) (Tij — Cry), (7) 


where cx is the N-dimensional feature vector coordinate of 
the cluster center k. a,(x;) is a soft sign function to calculate 
the similarity of x; to the cluster center k. The similarity 
function is usually computed using a full connection and 
the activation function is a Softmax function: 


ew zi+bk 


an (i) = 
Wea 
The video-level feature descriptor y is then obtained by 
aggregating all the frame-level features and is expressed by 
the following equation: 


M 
Yjk = > Vijk, (9) 


The set of local features is defined as the unordered 
features obtained in different segments of the same video, 
and the L x M dimensional matrix X is used to denote L 
local features, each row being a separate local feature vector. 


(10) 


(8) 


ewTzi+bs 


X = (£1, £2,..., 2L), 


The key frames are selected by using the attention 
mechanism and thus combined into global features. In the 
classification task, attention is static, and the input contains 
only its own local feature vector. The first step is to analyze 


the importance of each feature and then boost the weight of 
key features as much as possible, while ignoring irrelevant 
features and noise. The attention output can be considered 
as a weighted collection of vectors : 


v=ax, (11) 


where a is an L-dimensional weight vector, calculated using 
the weight function. 

It is important to choose the appropriate weight 
calculation function throughout the process, where the input 
is the set of local features X and the output is the weight 
vector a , which is L1-normalized to 1 . Each dimension of 
the weight vector corresponds to a local feature. There are 
more methods to to compute the weights of local features, 
for example, global pooling is considered as a decaying form 
of the attention mechanism, and its corresponding weight 
function is: 

(12) 


where | is an L-dimensional vector with all elements of 1. To 
obtain a more flexible attention weight function, a layer of 
full connectivity is used to learn the weight coefficients. 


a = softmaz(wXT +b), (13) 


where w,b are the vectors of M and L dimensions, 
respectively. 


4 EXPERIMENTS 
4.1 Dataset 


The test data used for the experiment is a publicly available 
dataset from Tiktok. The dataset contains about 1.5 million 
pieces of data and 2000 users’ viewing records of 12,000 
micro-videos. The original dataset is randomly divided into 
a training set and a test set, accounting for 80% and 20%, 
respectively. 


4.2 Baselines 


FM [39]: Factorization Machine, which simulates first-order 
feature importance and second-order feature interactions. 

DeepFM [40]: DeepFM is an end-to-end model of a joint 
decomposer and multilayer sensing machine, which uses 
deep neural networks and factorization machines to model 
the interactions of higher-order features and lower-order 
features, respectively. 

VBPR [41]:The model integrates visual information into 
the prediction of people’s preferences, and VBPR is a 
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Fig. 3: Test set top-5 result. 


significant improvement over matrix decomposition models 
that rely only on user hidden vectors and item hidden 
vectors. 


4.3 Metrics 


There are many different evaluation systems for the 
goodness and efficiency of recommendation systems. In this 
paper, we use Precision and Recall to judge the efficiency 
and effectiveness of the recommendation function. 


ie | Ru) N S(u)| 


Precision = ; (14) 
Vue Roul 
Ria) A Siu 
Recall = Lucy ey 15 l (15) 
Daueu [Sol 


where Ru) denotes the final result recommended to the 
user, and S(,,) denotes the data source generated by all user 
actions. 


4.4 Result Analysis 


During the learning process, several experiments were 
conducted at 100, 200, 500, 1000, 2000 and 5000 iterations, 
and the average values were obtained. After that, the top- 
5 accuracy results with different iterations were compared, 
and the results are shown in Figure . Through this figure 
, it can be found that the top-5 accuracy is 0.33 when the 
number of iterations is 1000, but the setting of the number 
of iterations is not better, and the accuracy tends to decrease 
when the number of iterations increases from 1000 to 5000. 

As the figure shows clearly the recommendation 
algorithm (Ours) proposed in this paper is more than 
FM, DeepFM and VBPR in the field of micro-video 
recommendation. The recommendation model proposed 
in this paper is compared with the common item-based 
collaborative filtering modl FM, where the former focuses 
on the relationship between users, while the latter focuses 
on the relationship between micro-videos. For micro-videos, 
the information itself is less, the quality varies, and the 
relationship between uploaded videos by uploaders is 
misleading and not easy to distinguish, which is difficult 
for finding the relationship between videos. While it is 
significantly easier to analyze the relationship between 
users, just find the user groups that are similar to the users. 


5 


In this paper, the recommendation model is compared with 
DeepFM, and the uncertain neighborhood collaborative 
filtering algorithm is one of the star algorithms, which 
balances the relationship between videos and users by 
adding an uncertainty factor. However, due to the sparse 
ratings of micro-videos, it does not fill this data matrix 
by other means, leading to a decrease in recommendation 
quality. The model in this paper, by heavily analyzing the 
explicit and invisible behaviors of users greatly expands the 
data matrix and ensures the quality of recommendations. 


5 CONCLUSION AND FUTURE WORK 


With the advent of ConvolutionalNeuralNetworks (CNN) 
, people no longer needs to design feature descriptors 
manually, but automatically learns video semantic features 
and understands image content, and eventually achieved 
great success in image classification, detection and retrieval. 
The algorithm of micro-video recognition based on deep 
learning has surpassed the iDT algorithm, making these 
traditional methods gradually fade out of people’s view. 
FaceBook even proposed to use 3D convolutional neural 
network to extract spatio-temporal information, which 
injected new vitality into the research of video classification. 

Researchers have also tried to apply it to video temporal 
modeling. Such methods usually first extract video frame 
features using CNNs and then input these features into 
RNNs in temporal order for temporal modeling. LSTM, 
on the other hand, is a commonly used RNN model for 
modeling video long-time dependencies. For example, Ng 
et al.used LSTM to fuse video frames and optical flow 
features, and experimentally verified the robustness of 
LSTM networks to optical flow noise and the effectiveness 
of video sequence feature fusion with LSTM. Video contains 
rich multimodal information, and fusing information from 
multiple modalities in a video classification task can 
improve the accuracy of the video classification task.The 
main contributions of this paper are as follows: Study the 
extraction methods of visual features, audio features and 
title features in videos. In this paper, visual information 
is mainly extracted by dividing the video into multiple 
frames, and then each frame is passed through a pre-trained 
network to extract visual features. The audio information 
is extracted by dividing the audio into frames, and then 
each frame is digitally processed to obtain the spectrum, and 
then the audio features are extracted by VGG network. The 
title information is mainly obtained by word mapping into 
word vectors through word cutting technique to get word 
features. According to the micro-video classification task, a 
new combinatorial network model is proposed to combine 
the discrete features of each modality into the overall 
features of various modalities through the network, and 
then fuse the various modal features to obtain the overall 
video features, which will be used for classification. In order 
to verify the effectiveness of the algorithm proposed in this 
paper, experiments are conducted in the public dataset, and 
it is shown the effectiveness of our model. 
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